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Abstract 

We predict discourse segment boundaries 
from linguistic features of utterances, using 
a corpus of spoken narratives as data. We 
present two methods for developing seg- 
mentation algorithms from training data: 
hand tuning and machine learning. When 
multiple types of features are used, results 
approach human performance on an inde- 
pendent test set (both methods), and using 
cross-validation (machine learning). 

1 Introduction 

Many have argued that discourse has a global struc- 
ture above the level of individual utterances, and 
that linguistic phenomena like prosody, cue phrases, 
and nominal reference are partly conditioned by and 
reflect this structure (cf. (Grosz and Hirschberg 



1992 ; Grosz and Sidner, 198£ ; Hirschberg and Grosz 



and Obcrlandcr, 1992; Linde, 1979) (Mann and 



1992; Hirschberg and Litman, 1993) (Hirschberg 
and Picrrchumbcrt, 1986; Hobbs, 1979; |Lascaridci 



Thompson, 1988| ; |Polanyi, 1988| ; |Rcichman, 198£ 



noun phrases, cue words, and pauses, respectively) 
with the subjects' segmentations. 

In this paper, we discuss two methods for develop- 
ing segmentation algorithms using multiple knowl- 
edge sources. In section |^, we give a brief overview 
of related work and summarize our previous results. 
In section ||, we discuss how linguistic features are 
coded and describe our evaluation. In section |^, we 
present our analysis of the errors made by the best 
performing untuned algorithm, and a new algorithm 
that relies on enriched input features and multiple 
knowledge sources. In section |^, we discuss our use 
of machine learning tools to automatically construct 
decision trees for segmentation from a large set of 
input features. Both the hand tuned and automat- 
ically derived algorithms improve over our previous 
algorithms. The primary benefit of the hand tun- 
ing is to identify new input features for improving 
performance. Machine learning tools make it con- 
venient to perform numerous experiments, to use 
large feature sets, and to evaluate results using cross- 
validation. We discuss the significance of our results 
and briefly compare the two methods in section ^. 

2 Discourse Segmentation 



Webber, 199ll H~ However, an obstacle to exploiting 2.1 Related Work 



the relation between global structure and linguis- 
tic devices in natural language systems is that there 
is too little data about how they constrain one an- 
other. We have been engaged in a study addressing 
this gap. In previous work (Passonneau and Lit 



man, 1993), we reported on a method for empiri- 
cally validating global discourse units, and on our 
evaluation of algorithms to identify these units. We 
found significant agreement among naive subjects on 
a discourse segmentation task, which suggests that 
global discourse units have some objective reality. 
However, we also found poor correlation of three 
untuned algorithms (based on features of referential 



Segmentation has played a significant role in much 
work on disco urse. The linguistic structure of Grosz 
and Sidner's ( 1986 ) tri-partite discourse model con- 
sists of multi-utterance segments whose hierarchical 
relations are isomorp hic with inte n tional structur e. 
In other work (e.g., ([Hobbs, 1979j ; |Polanyi, 1988b ), 
segmental structure is an artifact of coherence re- 
lations among utterances, and few if any specific 
claims are made regarding segmental structure per 
sc. Rhetorical St ructure Theory (RST) ( Mann and 



Thompson, 1988) is another tradition of defining re 
lations among utterances, and informs much work 
in generation. In addition, recent work (Moore and 



Bellcore did not support the second author's work. Paris, 1993; Moore and Pollack, 1992) has addressed 



the integration of intentions and rhetorical relations. 
Although all of these approaches have involved de- 
tailed analyses of individual discourses or represen- 
tative corpora, we believe there is a need for more 
rigorous empirical studies. 

Researchers have begun to investigate the ability 
of humans to agree with one another on segmen- 
tation, and to propose methodologies for quantify- 
ing their findings. Several studies have used expert 
coders to locally and globally structure spoken dis- 
course according to the model of Grosz and Sid- 



ner (1986 


), including (Grosz and Hirschberg, 1992] 


(Hirschberg and Grosz, 1992; 


Nakatani et al., 199E 


Stifleman, 1995). Hearst (1994 


) asked subjects 



to place boundaries between paragraphs of expos- 
itory texts, to indicate topic changes. Moser and 



Moore (199£) had an expert coder assign segments 
and various segment features and relations based on 
RST. To quantify the ir findings, these studies use 



notions of agreement (Gale et al., 1992; 


Moser and 


Moore, 1995|) and/or reliability (Passonneau and 


Litman, 1993; 


Passonneau and Litman, to appear; 


Isard and Carletta, 1995 


)• 



By asking subjects to segment discourse using a 
non-linguistic criterion, the correlation of linguis- 
tic devices with independently derived segments can 
then be investigated in a way that avoids circularity. 



Together, ( 


Grosz and Hirschberg, 1992 




Hirschberg 


and Grosz, 1992) ( 


Nakatani et al., 1995) com- 



prise an ongoing study using three corpora: pro- 
fessionally read AP news stories, spontaneous nar- 
rative, and read and spontaneous versions of task- 
oriented monologues. Discourse structures are de- 
rived from subjects' segmentations, then statisti- 
cal measures are used to characterize these struc- 
tures in terms of acoustic-prosodic features. Grosz 
and Hirschberg's work also used the classification 



and regression tree system CART (Breiman et al 



1984) to automatically construct and evaluate deci- 
sion trees for classifying aspects of discourse struc- 
ture from intonational feature values. Morris and 



Hirst ( 1991 ) struct ured a set of magazine t exts us- 
ing the theory of ( Grosz and Sidner, 1986 ), devel- 
oped a thesaurus-based lexical cohesion algorithm to 
segment text, then qualitatively co mpared their seg- 
mentations with the results. Hearst ( |1994|) presented 
two implemented segmentation algorithms based on 
term repetition, and compared the boundaries pro- 
duced to the boundaries marked by at least 3 of 7 
subjects, using information retrieval metrics. Koz- 
ima ( |l993j ) had 16 subjects segment a simplified 
short story, developed an algorithm based on lexi- 
cal cohesion, and qualitatively compared the results. 
Reynar ( 1994 ) proposed an algorithm based on lex- 



ical cohesion in conjunction with a graphical tech- 
nique, and used information retrieval metrics to eval- 
uate the algorithm's performance in locating bound- 
aries between concatenated news articles. 

2.2 Our Previous Results 

We have been investigating a corpus of mo nologues 
collected and transcribed by Chafe ( 198C), known 
as the Pear stories. As reported in (Passonneau 
and Litman, 1993), we first investigated whether 



units of global structure consisting of sequences of 
utterances could be reliably identified by naive sub- 
jects. We analyzed linear segmentations of 20 nar- 
ratives performed by naive subjects (7 new sub- 
jects per narrative), where speaker intention was 
the segment criterion. Subjects were given tran- 
scripts, asked to place a new segment boundary be- 
tween lines (prosodic phrases)]] wherever the speaker 
had a new communicative goal, and to briefly de- 
scribe the completed segment. Subjects were free 
to assign any number of boundaries. The qualita- 
tive results were that segments varied in size from 
1 to 49 phrases in length (Avg.=5.9), and the rate 
at which subjects assigned boundaries ranged from 
5.5% to 41.3%. Despite this variation, we found 
statistically significant agreement among subjects 
across all narratives on location of segment bound- 
aries (.114 x KT 6 < p < .6xl(T 9 ). 

We then looked at the predictive power of lin- 
guistic cues for identifying the segment boundaries 
agreed upon by a significant number of subjects. We 
used three distinct algorithms based on the distri- 
bution of referential noun phrases, cue words, and 
pauses, respectively. Each algorithm (NP-A, CUE- 
A, PAUSE-A) was designed to replicate the subjects' 
segmentation task (break up a narrative into con- 
tiguous segments, with segment breaks falling be- 
tween prosodic phrases). NP-A used three features, 
while CUE-A and PAUSE-A each made use of a sin- 
gle feature. The features are a subset of those de- 
scribed in section ||. 

To evaluate how well an algorithm predicted seg- 
mental structure, we used the information retrieval 
(IR ) metrics described in section p|. A s reported 
in ( passonneau and Litman, to appear ), we also 
evaluated a simple additive method for combining 
algorithms in which a boundary is proposed if each 
separate algorithm proposes a boundary. We tested 
all pairwise combinations, and the combination of 
all three algorithms. No algorithm or combination 
of algorithms performed as well as humans. NP- 
A performed better than the other unimodal algo- 
rithms, and a combination of NP-A and PAUSE-A 



J We used Chafe's ( L980| ) prosodic analysis. 



.Because he's looking at the girl. 



• P 



rosodic 



FeatL 



1 SUBJECT (non-boundary) 



.75] Falls 



5 SUBJECTS 



(boundary) 



[1.35] uh there's no conversation in th 



is movie. 



SUBJECTS (non-boundary) 



There's sounds, 



SUBJECTS (non-boundary) 



you know, 



SUBJECTS (non-boundary) 



like the birds and stuff, 



SUBJECTS (non-boundary) 



but there., the humans beings in it don't say anything 



7 SUBJECTS 



(boundary) 



[1.0] He falls over, 



Figure 1: Excerpt from narr. 6, with boundaries. 



performed best. We felt that significant improve- 
ments could be gained by combining the input fea- 
tures in more complex ways rather than by simply 
combining the outputs of independent algorithms. 

3 Methodology 

3.1 Boundary Classification 

We represent each narrative in our corpus as a se- 
quence of potential boundary sites, which occur be- 
tween prosodic phrases. We classify a potential 
boundary site as boundary if it was identified as such 
by at least 3 of the 7 subjects in our earlier study. 
Otherwise it is classified as non-boundary. Agree- 
ment among subjects on boundaries was significant 
at below the .02% level for values of j > 3, where j is 
the number of subjects (1 to 7), on all 20 narratives.^ 
Fig. [I] shows a typical segmentation of one of the 
narratives in our corpus. Each line corresponds to 
a prosodic phrase, and each space between the lines 
corresponds to a potential boundary site. The brack- 
eted numbers will be explained below. The boxes in 

thp ficmrp show trio suhjprts' rpsponspw at, parti pn- 



— before: -(-sentence, final. contour, -sentence, final, contour 

— after: +scntciicc. final. contour, -sentence. final. contour. 

— pause: true, false. 

— duration: continuous. 

• Cue Phrase Features 

— cuei: true, false. 

— wordi: also, and, anyway, basically, because, but, fi- 
nally, first, like, meanwhile, no, now, oh, okay, only, 
or, see, so, then, well, where, NA. 

— cue2: true, false. 

— word2: and, anyway, because, boy, but, now, okay, or, 
right, so, still, then, NA. 



Nc 



Phr 



Featu 



— corcf: -fcoref, -coref, NA. 

— infer: -(-infer, -infer, NA. 

— global. pro: -(-global. pro, -global. pro, NA. 

• Combined Feature 

— cue-prosody: complex, true, false. 

Figure 2: Features and their potential values. 



Values for the prosodic features are obtained by 
automatic analysis of the transcripts, whose con- 



ventions are defined in (Chafe, 1980) and illus- 
trated in Fig. |l|: "." and "?" indicate sentence- 
final intonational contours; "," indicates phrase- 
final but not sentence final intonation; "[X]" indi- 
cates a pause lasting X seconds; indicates a 
break in timing too short to be measured. The fea- 
tures before and after depend on the final punctua- 
tion of the phrases Pi and P»+i, respectively. The 
value is '+sentence. final. contour' if "." or "?", '- 
sentence. final. contour' if "," . Pause is assigned 'true' 
if Pi+i begins with [X], 'false' otherwise. Duration 
is assigned X if pause is 'true', otherwise. 

The cue phrase features are also obtained by au- 
tomatic analysis of the transcripts. Cuei is assigned 
'true' if the first lexical item in Pj+i is a member of 



the set of cue words summarized in ( Hirschberg and 



tentia] boundary site and the resulting boundary Litman, 1993|) . Wordi is assigned this lexical item if 



classification. Only 2 of the 7 possible boundary 
sites are classified as boundary. 

3.2 Coding of Linguistic Features 

Given a narrative of n prosodic phrases, the n-1 
potential boundary sites are between each pair of 
prosodic phrases Pi and Pi+i, i from 1 to n-1. Each 
potential boundary site in our corpus is coded using 
the set of linguistic features shown in Fig. @. 



cuei is true, 'NA' (not applicable) otherwise.^ CW2 
is assigned 'true' if cue\ is true and the second lex- 
ical item is also a cue word. Wora\ is assigned the 
second lexical item if cuei is true, 'NA' otherwise. 

Two of the noun phrase (NP) features are hand- 
coded, along with functionally independent clauses 
(FICs), following ( Passonncau, 1994 ). The two au- 
thors coded independently and merged their results. 
The third feature, global.pro, is computed from the 
hand coding. FICs are tensed clauses that are nei- 



We previously used agreement by 4 subjects as the 
threshold for boundaries: for i > 4. agreement was signif - 
icant at the .01% level. ( Passonneau and Litman, 1993| ) 



3 The cue phrases that occur in the corpus are shown 
as potential values in Fig. |^. 



ther verb arguments nor restrictive relatives. If 
a new FIC (C,-) begins in prosodic phrase Pi+i, 
then NPs in Cj are compared with NPs in previous 
clauses and the feature values assigned as follows^ 



1 . coref 



-coref if C, contains an NP that 



corefers with an NP in Cj_i; else coref = '-coref 

2. infer — ' -f infer' if Cj contains an NP whose ref- 
erent can be inferred from an NP in Cj_i on the 
basis of a pre-defined set of inference relations; 
else infer — '-infer' 

3. global. pro — '+global.pro' if Cj contains a def- 
inite pronoun whose referent is mentioned in a 
previous clause up to the last boundary assigned 
by the algorithm; else global. pro — '-global. pro' 

If a new FIC is not initiated in Pj+i, values for all 
three features are 'NA'. 

Cue-prosody, which encodes a combination of 
prosodic and cue word features, was motivated by 
an analysis of IR errors on our training data, as de- 
scribed in section ^. Cue-prosody is 'complex' if: 

1. before = '-fsentence. final. contour' 

2. pause = 'true' 

3. And either: 

(a) cuei — 'true', wordi ^ 'and' 

(b) cue\ — 'true', word\ = 'and', cuei = 'true', 
word2 ^ 'and' 

Else, cue-prosody has the same values as pause. 

Fig. U illustrates how the first boundary site in 
Fig. [J would be coded using the features in Fig. 0. 

The prosodic and cue phrase features were moti- 
vated by previous results in the literature. For ex- 
ample, phrases beginning discourse segments were 



correlated with preceding pause duration in (Grosz 
and Hirschberg, 1992 ; Hirschberg and Grosz, 1992[) 



These and other studies (e.g., ( Hirschberg and Lit 



man, 1993| )) also found it useful to distinguish be- 
tween sentence and non-sentence final intonational 
contours. Initial phrase position was correlated with 
discourse signaling uses of cue words in ( Hirschberg] 
and Litman, 1993| ); a potential correlation between 
discourse signaling uses of cue words and adjacency 
patterns between cue words was also suggested. Fi- 
nally, (Litman, 1994) found that treating cue phrases 
individually rather than as a class enhanced the re- 



sults of (Hirschberg and Litman, 1993) 



4 The NP algorithm can assign multiple boundaries 
within one prosodic phrase if the phrase contains mul- 



tiple clauses: these very ra re cases are normalized ( [Pas- 
sonneau and Litman, 1993| ). 



Passonneau ( bo appear] ) examined some of the few 
claims relating discourse anaphoric noun phrases to 
global discourse structure in the Pear corpus. Re- 
sults included an absence of correlation of segmen- 
tal structure with centering ( Grosz et al., 1983] ; 



Kameyama, 1986| ), and poor correlation with the 
contrast between full noun phrases and pronouns. 



As noted in (Passonneau and Litman, 1993), the 
NP features largely reflect Passonneau's hypothe- 
ses that adjacent utterances are more likely to con- 
tain expressions that corefer, or that are infercntially 
linked, if they occur within the same segment; and 
that a definite pronoun is more likely than a full 
NP to refer to an entity that was mentioned in the 
current segment, if not in the previous utterance. 

3.3 Evaluation 

The segmentation algorithms presented in the next 
two sections were developed by examining only a 
training set of narratives. The algorithms are then 
evaluated by examining their performance in pre- 
dicting segmentation on a separate test set. We cur- 
rently use 10 narratives for training and 5 narratives 
for testing. (The remaining 5 narratives are reserved 
for future research.) The 10 training narratives 
range in length from 51 to 162 phrases (Avg.=101.4), 
or from 38 to 121 clauses (Avg.=76.8). The 5 test 
narratives range in length from 47 to 113 phrases 
(Avg.=87.4), or from 37 to 101 clauses (Avg.=69.0). 
The ratios of test to training data measured in narra- 
tives, prosodic phrases and clauses, respectively, are 
50.0%, 43.1% and 44.9%. For the machine learning 
algorithm we also estimate performance using cross- 
validation ( Weiss and Kulikowski, 1991 ), as detailed 
in Section 

To quantify algorithm performance, we use the in- 
formation retrieval metrics shown in Fig. [|. Recall 
is the ratio of correctly hypothesized boundaries to 
target boundaries. Precision is the ratio of hypoth- 
esized boundaries that are correct to the total hy- 
pothesized boundaries. (Cf. Fig. [| for fallout and 
error.) Ideal behavior would be to identify all and 
only the target boundaries: the values for b and c 



Algorithm 


Subjects 


Boundary Non-Boundary 


Boundary 
Non-Boundary 


a 


b 


c 


d 



Recall 



Precision = 



(a + b) 



Fallout 
Error = 



(b+d) 

jb+c) 



(a + b+c+d) 

Figure 4: Information retrieval metrics. 



..Because hc^'s looking at the girl. 
[.75] (ZERO-PRONOUN;) Falls over, 



before afte 



pause duration cuei wordi cue2 words coref infer global. pro cue-prosody 



+s.f.c 



-s.f.c 



true 



.75 false 



NA false 



NA 



+ 



true 



Figure 3: Example feature coding of a potential boundary site. 





Recall 


Prcc 


Fall 


Error 


SumDcv 


Training Set 


.63 


.72 


.06 


.12 


.83 


Test Set 


.64 


.68 


.07 


.11 


.86 



Table 1: Average human performance. 

in Fig. ^ would thus both equal 0, representing no 
errors. The ideal values for recall, precision, fallout, 
and error are 1, 1, 0, and 0, while the worst val- 
ues are 0, 0, 1, and 1. To get an intuitive summary 
of overall performance, we also sum the deviation of 
the observed value from the ideal value for each met- 
ric: (Trecall) + (1-precision) + fallout + error. The 
summed deviation for perfect performance is thus 0. 

Finally, to interpret our quantitative results, we 
use the performance of our human subjects as a tar- 
get goal for the performance of our algorithms ( Galej 
ct al., 1992| ). Table [l] shows the average human per- 
formance for both the training and test sets of nar- 
ratives. Note that human performance is basically 
the same for both sets of narratives. However, two 
factors prevent this performance from being closer 
to ideal (e.g., recall and precision of 1). The first is 
the wide variation in the number of boundaries that 
subjects used, as discussed above. The second is the 
inherently fuzzy nature of boundary location. Wc 
discuss this second issue at length in ( Passonneau| 
and Litman, to appear), and present relaxed IR met- 
rics that penalize near misses less heavily in (Litman 
and Passonneau, 1995). 



4 Hand Tuning 

To improve performance, we analyzed the two t ypes 



of IR errors made by the orig inal NP algorithm ( Pas 



Bonneau and Litman, 19931) on the training data. 
Type "b" errors (cf. Fig. Q), mis-classification of 
non-boundaries, were reduced by changing the cod- 
ing features pertaining to clauses and NPs. Most 
"b" errors correlated with two conditions used in the 
NP algorithm, identification of clauses and of infer- 
ential links. The revision led to fewer clauses (more 
assignments of 'NA' for the three NP features) and 
more inference relations. One example of a change 
to clause coding is that formulaic utterances having 
the structure of clauses, but which function like in- 
terjections, are no longer recognized as independent 



clauses. These include the phrases let's see, let me 
see, I don't know, you know when they occur with no 
verb phrase argument. Other changes pertained to 
sentence fragments, unexpected clausal arguments, 
and embedded speech. 

Three types of inference relations linking succes- 
sive clauses ( Cj_i, Q) were add ed (originally there 
were 5 types ( Passonneau, 1994 )). Now, a pronoun 
(e.g., it, that, this) in Cj referring to an action, event 
or fact inferrable from Cj_i links the two clauses. So 
does an implicit argument, as in Fig. ^ where the 
missing argument of notice is inferred to be the event 
of the pears falling. The third case is where an NP 
in Cj is described as part of an event that results 
directly from an event mentioned in Cj_i. 

"C" type errors (cf. Fig. ||), mis-classification 
of boundaries, often occurred where prosodic and 
cue features conflicted with NP features. The origi- 
nal NP algorithm assigned boundaries wherever the 
three values '-coref, '-infer', '-global. pro' (defined in 
section |[) co-occurred, represented as the first con- 
ditional statement of Fig. [| Experiments led to the 
hypothesis that the most improvement came by as- 
signing a boundary if the cue-prosody feature had 
the value 'complex', even if the algorithm would not 
otherwise assign a boundary, as shown in Fig. |^. 

We refer to the original NP algorithm applied to 
the initial coding as Condition 1, and the tuned al- 
gorithm applied to the enriched coding as Condition 
2. Table [2] presents the average IR scores across the 
narratives in the training set for both conditions. 
Reduction of "b" type errors raises precision, and 
lowers fallout and error rate. Reduction of "c" type 
errors raises recall, and lowers fallout and error rate. 
All scores improve in Condition 2, with precision and 
fallout showing the greatest relative improvement. 
The major difference from human performance is rel- 
atively poorer precision. 

The standard deviations in Table ^ are often close 
to 1/4 or 1/3 of the reported averages. This indicates 



ci. 



Phr. 
3.01 



3.02 



[1.1 [.7] A-nd] he's not really., doesn't seem 
to be paying all that much attention 
[.55? because [.45]] you know the pears falU 

and., he doesn't really notice (0i), 



Figure 5: Inferential link due to implicit argument. 



if (corcf — -corcf and infer — -infer and global. pro — -global. pro) 
then boundary 
elseif cue-prosody — complex then boundary 
else non- 



Figure 6: Condition 2 algorithm. 



a large amount of variability in the data, reflecting 
wide differences across narratives (speakers) in the 
training set with respect to the distinctions recog- 
nized by the algorithm. Although the high standard 
deviations show that the tuned algorithm is not well 
fitted to each narrative, it is likely that it is overspe- 
cialized to the training sample in the sense that test 
narratives are likely to exhibit further variation. 

Table || shows the results of the hand tuned al- 
gorithm on the 5 randomly selected test narratives 
on both Conditions 1 and 2. Condition 1 results, 
the untuned algorithm with the initial feature set, 
are very similar to the training set except for worse 
precision. Thus, despite the high standard devia- 
tions, 10 narratives seems to have been a sufficient 
sample size for evaluating the initial NP algorithm. 
Condition 2 results are better than condition 1 in 
Table [| and condition 1 in Table g. This is strong 
evidence that the tuned algorithm is a better pre- 
dictor of segment boundaries than the original NP 
algorithm. Nevertheless, the test results of condition 
2 are much worse than the corresponding training re- 
sults, particularly for precision (.44 versus .62). This 
confirms that the tuned algorithm is over calibrated 
to the training set. 



5 Machine Learning 



We use the machine learning program C4.5 (Quin- 
lan, 1993| ) to automatically develop segmentation al- 



gorithms from our corpus of coded narratives, where 
each potential boundary site has been classified and 
represented as a set of linguistic features. The first 
input to C4.5 specifies the names of the classes to 



Average 


Recall 


Prcc 


Fall 


Error 


SumDcv 


Condition 1 


.42 


.40 


.14 


.22 


1.54 


Std. Dev. 


.17 


.12 


.06 


.07 


.34 


Condition 2 


.58 


.62 


.08 


.14 


1.02 


Std. Dev. 


.14 


.10 


.04 


.05 


.18 


Table 2 


: Performance on training 


; set. 


Average 


Recall 


Prcc 


Fall 


Error 


SumDcv 


Condition 1 


.44 


.29 


.16 


.21 


1.64 


Std. Dev. 


.18 


.17 


.07 


.05 


.32 


Condition 2 


.50 


.44 


.11 


.17 


1.34 


Std. Dev. 


.21 


.06 


.03 


.04 


.29 



if before — -sentence. final. contour then non-boundary 
elseif before — -(-sentence. final. contour then 
if corcf — NA then non-boundary 
elseif corcf — -(-corcf then 

if after — +sentcnce. final. contour then 
if duration < 1.3 then non-boundary 
elseif duration > 1.3 then boundary 
elseif after — -sentence. final. contour then 

if wordi G { also, basically,because,finally,first, like, 
meanwhile, no, oh. okay, only,see, so, well, where, N A} 
then non-boundary 
elseif wordi £ {anyway,but,now,or,thcn} then boundary 
elseif wordi — and then 

if duration < 0.6 then non-boundary 
elseif duration > 0.6 then boundary 
elseif coref — -coref then 

if infer — -(-infer then non-boundary 
elseif infer — NA then boundary 
elseif infer — -infer then 

if after — -sentence. final. contour then boundary 
elseif after — -(-sentence. final. contour then 
if cuci — true then 

if global. pro — NA then boundary 
elseif global. pro — -global. pro then boundary 
elseif global. pro — +global.pro then 
if duration < 0.65 then non-boundary 
elseif duration > 0.65 then boundary 
elseif cuci — false then 

if duration > 0.5 then non-boundary 
elseif duration < 0.5 then 

if duration < 0.35 then non-boundary 
elseif duration > 0.35 then boundary 



Figure 7: Learned decision tree for segmentation. 



be learned (boundary and non-boundary), and the 
names and potential values of a fixed set of coding 
features (Fig. ||). The second input is the train- 
ing data, i.e., a set of examples for which the class 
and feature values (as in Fig. ||) are specified. Our 
training set of 10 narratives provides 1004 exam- 
ples of potential boundary sites. The output of C4.5 
is a classification algorithm expressed as a decision 
tree, which predicts the class of a potential boundary 
given its set of feature values. 

Because machine learning makes it convenient 
to induce decision trees under a wide variety of 
conditions, we have performed numerous experi- 
ments, varying the number of features used to code 
the training data, the definitions used for classify- 
ing a potential boundary site as boundary or non- 
boundar^ and the options available for running the 
C4.5 program. Fig. [?] shows one of the highest- 
performing learned decision trees from our experi- 
ments. This decision tree was learned under the fol- 
lowing conditions: all of the features shown in Fig. ^ 
were used to code the training data, boundaries were 
classified as discussed in section ||, and C4.5 was run 
using only the default options. The decision tree 
predicts the class of a potential boundary site based 



Table 3: Performance on test set. 



of su 



(Litman and Passonneau, 1995) varies the number 



ejects used to determine boundaries. 



on the features before, after, duration, cue\, word\, 
coref, infer, and global.pro. Note that although not 
all available features are used in the tree, the in- 
cluded features represent 3 of the 4 general types of 
knowledge (prosody, cue phrases and noun phrases). 
Each level of the tree specifies a test on a single fea- 
ture, with a branch for every possible outcome of the 
test.[] A branch can either lead to the assignment of 
a class, or to another test. For example, the tree 
initially branches based on the value of the feature 
before. If the value is '-sentence. final. contour' then 
the first branch is taken and the potential boundary 
site is assigned the class non-boundary. If the value 
of before is '-fsentence. final. contour' then the second 
branch is taken and the feature coref is tested. 

The performance of this learned decision tree av- 
eraged over the 10 training narratives is shown in 
Table |[ on the line labeled "Learning 1" . The line 
labeled "Learning 2" shows the results from another 
machine learning experiment, in which one of the 
default C4.5 options used in "Learning 1" is over- 
ridden. The "Learning 2" tree (not shown due to 
space restrictions) is more complex than the tree of 
Fig. 0, but has slightly better performance. Note 
that "Learning 1" performance is comparable to hu- 
man performance (Table Pfl), while "Learning 9" is 



Average 


Recall 


Prcc 


Fall 


Error 


SumDev 


Learning 1 




.54 


.76 


.04 


.11 


.85 


Std. Dev. 




.18 


.12 


.02 


.04 


.28 


Learning 2 




.59 


.78 


.03 


.10 


.76 


Std. Dev. 




.22 


.12 


.02 


.04 


.29 


Table 4 


Performance on training set. 


Average 


Recall 


Prcc 


Fall 


Error 


SumDev 


Learning 1 




.43 


.48 


.08 


.16 


1.34 


Std. Dev. 




.21 


.13 


.03 


.05 


.36 


Learning 2 




.47 


.50 


.09 


.16 


1.27 


Std. Dev. 




.18 


.16 


.04 


.07 


.42 


Table 5: Performance on test set. 


Average 


Recall 


Prcc 


Fall 


Error 


SumDev 


Learning 1 




.43 


.63 


.05 


.15 


1.14 


Std. Dev. 




.19 


.16 


.03 


.03 


.24 


Learning 2 




.46 


.61 


.07 


.15 


1.15 


Std. Dev. 




.20 


.14 


.04 


.03 


.21 



Table 6: Using 10-fold cross-validation. 



from scratch and thus each training and testing set 
are still disjoint. While this method does not make 
sense for humans, computers can truly ignore pre- 
vious iterations. For sample sizes in the hundreds 
(our 10 narratives provide 1004 examples) 10-fold 
cross-validation often provides a better performance 



estimate than the hold-out method (Weiss and Ku- 



lightly better than humans The results obtained 



likowski, 199l| ). Results using cross-validation are 



via machine learning are also somewhat better than 
the results obtained using hand tuning — particularly 
with respect to precision ( "Condition 2" in Table ||) , 
and are a great improvement over the original NP 
results ("Condition 1" in Table |). 

The performance of the learned decision trees av- 
eraged over the 5 test narratives is shown in Table ||. 
Comparison of Tables [| and || shows that, as with the 
hand tuning results (and as expected) , average per- 
formance is worse when applied to the testing rather 
than the training data particularly with respect to 
precision. However, performance is an improvement 
over our previous best results ("Condition 1" in Ta- 
ble ||) , and is comparable to ( "Learning 1" ) or very 
slightly better than ( "Learning 2" ) the hand tuning 
results ("Condition 2" in Table |). 

We also use the resampling method of cross- 



validation (Weiss and Kulikowski, 1991) to estimate 
performance, which averages results over multiple 
partitions of a sample into test versus training data. 
We performed 10 runs of the learning program, each 
using 9 of the 10 training narratives for that run's 
training set (for learning the tree) and the remaining 
narrative for testing. Note that for each iteration 
of the cross-validation, the learning process begins 



6 The actual tree branches on every value of wordi; 
the figure merges these branches for clarity. 



shown in Table [| and are better than the estimates 
obtained using the hold-out method (Table |J), with 
the major improvement coming from precision. Be- 
cause a different tree is learned on each iteration, 
the cross-validation evaluates the learning method, 
not a particular decision tree. 

6 Conclusion 

We have presented two methods for developing seg- 
mentation hypotheses using multiple linguistic fea- 
tures. The first method hand tunes features and 
algorithms based on analysis of training errors. The 
second method, machine learning, automatically in- 
duces decision trees from coded corpora. Both meth- 
ods rely on an enriched set of input features com- 
pared to our previous work. With each method, 
we have achieved marked improvements in perfor- 
mance compared to our previous work and are ap- 
proaching human performance. Note that quantita- 
tively, the machine learning results are slightly bet- 
ter than the hand tuning results. The main differ- 
ence on average performance is the higher precision 
of the automated algorithm. Furthermore, note that 
the machine learning algorithm used the changes to 
the coding features that resulted from the tuning 
methods. This suggests that hand tuning is a use- 
ful method for understanding how to best code the 



data, while machine learning provides an effective 
(and automatic) way to produce an algorithm given 
a good feature representation. 

Our results lend further support to the hypoth- 
esis that linguistic devices correlate with discourse 
structure (cf. section 2.1), which itself has practi- 
cal import. Understanding systems could infer seg- 
ments as a step towards producing summaries, while 
generation systems could signal segments to increase 
comprehensibilityj^ Our results also suggest that to 
best identify or convey segment boundaries, systems 
will need to exploit multiple signals simultaneously. 

We plan to continue our experiments by further 
merging the automated and analytic techniques, and 
evaluating new algorithms on our final test corpus. 
Because we have already used cross-validation, we 
do not anticipate significant degradation on new test 
narratives. An important area for future research 
is to develop principled methods for identifying dis- 
tinct speaker strategies pertaining to how they signal 
segments. Performance of individual speakers varies 
widely as shown by the high standard deviations in 
our tables. The original NP, hand tuned, and ma- 
chine learning algorithms all do relatively poorly on 
narrative 16 and relatively well on 11 (both in the 
test set) under all conditions. This lends support 
to the hypothesis that there may be consistent dif- 
ferences among speakers regarding strategies for sig- 
naling shifts in global discourse structure. 
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